A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
Lets start with a hello world program (matching characters):
In [ ]:
#module for regular expressions
import re
match = re.search('rld', 'hello world')
if match:
print match.group()
else:
print "No matching pattern"
In the above example re.search(pat, str)
searches for the pattern pat
in the string str
. If the search is sucessful, a match object is returned, else it would return None
. The code match.group()
would return the matching text.
a, X, 9
-- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings
.
(a period) -- matches any single character except newline '\n'
</code>\w</code> -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word.
\W
(upper case W) matches any non-word character.
\b
-- boundary between word and non-word
\s
-- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f].
\S
(upper case S) matches any non-whitespace character.
\t, \n, \r
-- tab, newline, return
\d
-- decimal digit [0-9]
^ = start, $ = end
-- match the start or end of the string
\
-- inhibit the "specialness" of a character. So, for example, use . to match a period or \ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.
Below are some examples:
In [ ]:
# . matches to any character except to '\n'
match = re.search('..o', 'hello pythonistas')
print match.group()
In [ ]:
# \w matches to any word character
match = re.search('\w\w', 'hello pythonistas')
print match.group()
In [ ]:
# \d matches to any digit character
match = re.search('\w\d\d', 'abc123')
print match.group()
You can try out few more examples not mentioned above.
In [ ]:
\+
-- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
\*
-- 0 or more occurrences of the pattern to its left
?
-- match 0 or 1 occurrences of the pattern to its left
Remeber the rule Leftmost and Largest : The search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible
Below are some examples:
In [ ]:
match = re.search(r'l+o', 'hellllllllllllllllo world')
print match.group()
In [ ]:
# Note that this regex would not search the next set of 'llllllo'
# Leftmost and Largest
match = re.search(r'l+o', 'hellllllllloaaaalllllo world')
print match.group()
In [ ]:
# \s* = zero or more whitespace chars
# Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2 3xx')
print 'first match', match.group()
match = re.search(r'\d\s*\d\s*\d', 'xx12 3xx')
print 'second match', match.group()
match = re.search(r'\d\s*\d\s*\d', 'xx123xx')
print 'third match', match.group()
In [ ]:
#Lets try to find email in a particular string
match = re.search(r'\w+@\w+', 'foo blah blah foo@bar.com')
print match.group()
In the above example you could see the code returned foo@bar
, instead of foo@bar.com
. The reason being that "." is not considered as word character.
Here comes the concept of square brackets:
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. So the next code:
In [ ]:
match = re.search(r'[\w.-]+@[\w.-]+', 'foo blah blah foo-foo@bar.com')
print match.group()
findall
findall()
is probably the single most powerful function in the re
module. Above we used re.search()
to find the first match for a pattern. findall()
finds all the matches and returns them as a list of strings, with each string representing one match.
In [ ]:
string = 'foo-bar@foo.com blah blah blah hello@world.com foo blah'
emails = re.findall(r'[\w.-]+@[\w.-]+', string)
for email in emails:
print email
In [ ]:
# Suppose we want to have username and host separately
# for that we can use ()
emails = re.findall(r'([\w.-]+)@([\w.-]+)', string)
for email_tup in emails:
print 'username = ' + email_tup[0]
print 'host = ' + email_tup[1]
In [ ]:
from IPython.core.display import Image
Image(filename='files/regular_expressions.png')
#The same features also available in python
This was an introductory level regular expressions. More details can be found over: http://docs.python.org/2/library/re.html http://docs.python.org/2/howto/regex.html